A Data Scientist's Guide to the S&P 500

Stefan Obradovic

Published Online: December 21st, 2020, University of Maryland, College Park


Introduction

In 2008, Warren Buffett, a multi-billion dollar career investor and CEO of Berkshire Hathaway[1], made a \$1 million bet with the Hedge Fund Industry. His bet was that, accounting for fees, the S&P500 index would outperform a professionally curated Hedge Fund stock portfolio over 10 years. Protégé Partners LLC, an asset management firm based in Manhattan, New York took up the bet and 10 years later named Buffett the winner. In fact, Buffett claimed a staggering win. During the 10-year period the S&P500 gained an average of 7.1% per year, compared to 2.2% per year for Protégé [2].

So what is the S&P500?

The S&P 500 is a stock market index that measures the performance of roughly 500 large American companies [3]. As of the date of publication, there is currently over $11.2 Trillion (USD) invested in the S&P500, representing about 80% of the US stock market's value [4]. As such, the performance of the S&P500 index has been used widely as a metric to roughly estimate the overall performance of the market[5]. In fact, the S&P500 and other similar index funds have been widely regarded by investment moguls, like Warren Buffett himself, as some of the easiest, safest, and most consistent investments[6]. Simply put this is because index funds, like the S&P500, have a couple of very desireable qualities:

1) It is incredibly easy to invest in the S&P500.

The easiest way to invest in the S&P 500 is to buy an exchange-traded fund (ETF) that replicates the performance of the S&P 500 by holding the same stocks and in the same proportions. Exchange-traded funds, or ETFs, are a type of investment fund traded on public stock exchanges like the New York Stock Exchange. The amount that you invest in an ETF is divided between all of the ETF's investments and as a shareholder of the ETF, you are entitled to share the profits[7]. Examples of ETFs that mimic the holdings of the S&P500 index include but are not limited to the SPDR S&P 500 ETF Trust (SPY), the iShares S&P 500 Growth ETF (IVW), the iShares Core S&P 500 ETF (IVV), the Vanguard 500 Index Fund ETF (VOO), the SPDR Portfolio S&P 500 ETF (SPLG), and the Schwab US Large-Cap ETF (SCHX). Shares for each of these ETFs can be purchased on any electronic trading platform like Robinhood or stockbroker like Fidelity, TDAmeritrade, Charles Schwab, and many others.

2) The S&P500 is very diverse.

In the stock market as a whole, stocks and industries will not always perform well. However, if you invest in different "types" of stocks, it is very unlikely that all of them will perform poorly at the same time. By diversifying your portfolio between stocks in multiple sectors, market capitalizations, countries of origin, etc., you can avoid severe declines in the value of your investments. This is because companies in the stock market are highly dependent on each other, or respond the same way to the same economic forces for many reasons[8]. For instance, if the Toyota car manufacturer were to announce a sudden recall of over 2 million vehicles, this would likely drive the demand up for Ford vehicles, increasing their stock price. While a health care company might remain unaffected, this sort of announcement would often ripple throughout the market. American Airlines might see an increase in their stock price as more people opt for air travel and the raw materials manufacturer Steel Dynamics might see an increase in their stock price as well as Toyota looks to buy excess steel. Overall, the S&P500 is composed of over 500 companies in Communication Services, Consumer Discretionary, Consumer Staples, Energy, Financials, Health Care, Industrials, Information Technology, Materials, Real Estate, and Utilities[9]. As such, it has the quality of broad diversification which is extremely desireable for portfolio managers.

3) In the long-term, the S&P500 has large, positive returns.

At the end of the day, what really matters for investment portfolios is positive returns. No one would like to lose money on their investment. Since its inception in 1926, the S&P500 has grown approximately 9.8% each year. Additionally, the S&P500 increased annually 70% of the time, and saw losses only 30% of the time[10]. What this means is that for long-term investors, the S&P500 almost guarantees positive returns.

In summary, the goal of an investment portfolio is to spread the investment across a number of assets to minimize risk. Through diversification, we can pick a range of assets that minimize risk while maximizing potential returns.

So can we beat the S&P500?


Project Motivation

1) If we can better understand the composition of the S&P500, how each of these companies interact with each other, and what factors affect their stock prices, we can gain a deeper understanding of the benefits and pitfalls of the index.

2) We can potentially make a "better" S&P500 by subsetting only those stocks in the index that are the highest performers and minimally correlated with each other to further improve returns while retaining adequate portfolio diversification.

3) If we can predict future stock prices based on historical stock prices, we can buy stocks that we predict will perform well rather than holding all of the stocks in the S&P500.


Getting Started:

The following are required imports and prerequisite packages for the project:

In [1]:
import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (15, 9)
%matplotlib inline
import seaborn as sbn
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'


#Misc.
import psycopg2
from typing import Union, Tuple, List, Dict, Any, Optional
from pmdarima.arima import ADFTest
from sklearn.metrics import mean_squared_error
from fastdtw import fastdtw
import pmdarima as pm
from statsmodels.tsa.arima_model import ARIMA
import statsmodels.api as sm
import dtw

#Scipy - Clustering and Statistics
import scipy
import scipy.cluster.hierarchy as sch
from scipy import stats
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist, squareform, euclidean

#NetworkX - Graphs
import networkx as nx
import networkx

#RESTful APIs and Scraping
import requests
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import bs4 as bs

# NLTK VADER for sentiment analysis
from nltk.sentiment.vader import SentimentIntensityAnalyzer

import warnings
warnings.filterwarnings("ignore")
Importing the dtw module. When using in academic works please cite:
  T. Giorgino. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package.
  J. Stat. Soft., doi:10.18637/jss.v031.i07.

In [2]:
array_1d = Union[List, Tuple, pd.Series, np.ndarray]
date_obj = Union[datetime.datetime, datetime.date]

Part 1: Data Collection and Processing

First, we scrape Wikipedia to get a list of companies in the S&P500, their corresponding tickers on the New York Stock Exchange, and the Sector that the company falls under based on the Global Industry Classification Standard (GICS).

In [3]:
#Send a GET request to the Wikipedia API
resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies') #Returns the html of the Wikipedia page for S&P500 companies
soup = bs.BeautifulSoup(resp.text, 'lxml') #Parse the webpage from Wikipedia with BeautifulSoup

table = soup.find('table', {'class': 'wikitable sortable'}) #Find the contents under the table tag

tickers = [] #Stores tickers
names = [] #Stores company names
industries = [] #Stores GICS Industry

for row in table.findAll('tr')[1:]: #Iterates through each row in the table excluding the header and extracts relevant data
    tickers.append(row.findAll('td')[0].text)
    names.append(row.findAll('td')[1].text)
    industries.append(row.findAll('td')[3].text)
    
tickers = [s.replace('\n', '') for s in tickers] #delete newline characters in parsed data
stock_info = pd.DataFrame({'Ticker': tickers, 'Name': names, 'Sector': industries}) #Create Dataframe from Wikipedia data

stock_info = stock_info.replace('Communication Services\n','Communication Services') #Clean messy data
stock_info = stock_info.sort_values(by=['Ticker'], ignore_index = True) #Sort Alphabetically by Ticker

stock_info.head()
print(f' Number of stocks in stock_info: {stock_info.shape[0]}')
 Number of stocks in stock_info: 505

We set the time-window for this analysis to be between April 1st of 2019 and December 1st of 2019

In [4]:
start_date = datetime.datetime(2019, 4, 1)
end_date = datetime.datetime(2019, 12, 1)

Next, we will establish a connection to a PostgreSQL Database.

The database we will be using to query data is owned by the Smith Investment Fund, a student investment group at the University of Maryland. This database houses both equity and fundamental data purchased from Quandl, a premier source for financial and investment data.

The Quandl Equity Data (linked here) is updated daily and offers end of day prices, dividends, adjustments, and stock splits for US publicly traded stocks since 1996.

In [5]:
#The credentials for connecting to the database have been hidden in the anaconda environment
CONNECTION_PARAMS = {
    'database': os.environ.get("DATABASE"),
    'user': os.environ.get("USERNAME"),
    'password': os.environ.get("PASSWORD"),
    'host': os.environ.get("HOST"),
    'port': os.environ.get("PORT"),
}

The following functions are used to form and execute queries on the database:

In [6]:
def execute_query(
    query: str, params: array_1d = None, commit: bool = False
) -> Tuple:
    """Connects to the PostgreSQL database and executes a specified query.

    This query should be specified as a string and can contain placeholders.
    If placeholders are used, the second parameter should be a tuple of
    values to substitute into the query string.

    See the Psycopg2 documentation for more info:
    https://www.psycopg.org/docs/usage.html

    If the database will be modified by the query (i.e. INSERT) the third
    commit parameter should be set to True.

    Args:
        query: The entire SQL query (can contain placeholders)
        params: Values for each placeholder in the query
        commit: Whether to commit the changes to the database

    Returns:
        A tuple pair consisting of a list of column names and the results of
        the query. The results are a list of tuples, each corresponding to a
        table row. If the request fails, returns None.
    """

    col_names = None
    results = None
    with psycopg2.connect(**CONNECTION_PARAMS) as conn:
        cur = conn.cursor()
        cur.execute(query, params)  # Execute query

        # Get column names from response
        col_names = [desc.name for desc in cur.description]

        results = cur.fetchall()  # Get queried data

        if commit:
            conn.commit()

        cur.close()

    if results:
        return col_names, results

    return None
In [7]:
def get_data(table_name: str, tickers: Optional[Union[str, array_1d]] = None,
             start_date: Optional[date_obj] = None,
             end_date: Optional[date_obj] = None,
             attributes: Optional[array_1d] = None
             ) -> pd.DataFrame:
    """Gets data from either the equity or fundamental table.

    Filtering by ticker, attributes (e.g. close price for equity, or p/e for
    fundamental), or start/end dates are optional. If the ticker is a string,
    only data for that ticker will be queried. If the ticker is a list/tuple
    of strings, data for all the tickers will be queried. If no arguments are
    specified, the entire equity table is returned.

    Args:
        table_name: Gets data from either the equity or fundamental table
        tickers: Gets data for only this ticker or list of tickers.
        attributes: Gets data for columns corresponding to list of attributes
        start_date: Gets data from after this date (inclusive)
        end_date: Gets data from before this date (inclusive)
    Returns:
        A dataframe of the requested table data.
        If the database query fails, returns None.
    """

    # Keeps track of SQL WHERE conditions
    ticker_conds = []
    date_conds = []
    # Keeps track of placeholder values
    query_params = []
    # date column name
    date_col_name = (
        'date'
        if (table_name == 'equity' or table_name == 'equity_raw')
        else 'datekey'
    )

    # Add WHERE conditions for each ticker
    if tickers:
        if isinstance(tickers, (list, tuple)):
            for t in tickers:
                ticker_conds.append("ticker = %s")
                query_params.append(t)
        else:
            ticker_conds.append("ticker = %s")
            query_params.append(tickers)

    # Add WHERE conditions for the start and end date
    if start_date:
        date_conds.append(date_col_name + " >= %s")
        query_params.append(start_date)
    if end_date:
        date_conds.append(date_col_name + " <= %s")
        query_params.append(end_date)

    # Build the query and execute it
    ticker_query = " OR ".join(ticker_conds)
    date_query = " AND ".join(date_conds)

    # Build attributes string
    # TODO: Currently attributes can only be a list. Add functionality for sets,
    #  tuples etc. https://gitlab.com/sif_quant_team/siftools/-/issues/1
    if not attributes:
        attr_string = '*'
    else:
        if date_col_name not in attributes:
            attributes = [date_col_name] + attributes
        if 'ticker' not in attributes:
            attributes = ['ticker'] + attributes
        attr_string = ','.join(attributes)

    # Construct WHERE clause for query string
    query = f"SELECT {attr_string} FROM {table_name}"
    if ticker_query and date_query:
        query += f" WHERE ({ticker_query}) AND {date_query}"
    elif ticker_query:
        query += f" WHERE {ticker_query}"
    elif date_query:
        query += f" WHERE {date_query}"

    # if table is fundmental_raw, we also need to filter on dimension
    if table_name == 'fundamental_raw':
        if not (ticker_query or date_query):
            query += " WHERE dimension = \'ARQ\'"
        else:
            query += " AND dimension = \'ARQ\'"
    query += ";"

    cols, response = execute_query(query, params=query_params)

    # If the query was successful, convert and return
    if response:
        df = pd.DataFrame(data=response, columns=cols)
        df.drop_duplicates(
            subset=['ticker', date_col_name], keep='first', inplace=True
        )
        return df

    return None
In [8]:
def get_fundamental_data_day(
    tickers: Optional[Union[str, array_1d]] = None,
    date: Optional[date_obj] = None,
    attributes: Optional[array_1d] = None,
    day_threshold: int = 252,
    table_name: str = 'fundamental_raw',
) -> pd.DataFrame:
    """get_fundamental_data, but gets the nearest fundamental data point for
    a single day. This is the same as querying from a table with
    forward-filled values (daily).
    """

    # Keeps track of SQL WHERE conditions
    ticker_conds = []
    # Keeps track of placeholder values
    query_params = []
    # date column name

    # Add WHERE conditions for each ticker
    if tickers:
        if isinstance(tickers, (list, tuple)):
            for t in tickers:
                ticker_conds.append("ticker = %s")
                query_params.append(t)
        else:
            ticker_conds.append("ticker = %s")
            query_params.append(tickers)

    # Build ticker query
    ticker_query = " OR ".join(ticker_conds)
    ticker_query = f"AND ({ticker_query})" if ticker_query else ""

    # Build attributes string
    if not attributes:
        attr_string = '*'
    else:
        if 'ticker' not in attributes:
            attributes = ['ticker'] + attributes
        if 'datekey' not in attributes:
            attributes = ['datekey'] + attributes
        attr_string = ','.join(attributes)

    # Build full query
    query = f"""WITH summary AS (
                    SELECT {attr_string},
                        ROW_NUMBER() OVER(PARTITION BY ticker
                                                ORDER BY datekey DESC) AS rk
                    FROM {table_name}
                    where datekey <= '{date}'
                    {ticker_query}
                    AND dimension = \'ARQ\'
                )
                SELECT s.*
                FROM summary s
                WHERE s.rk = 1
                ;
            """

    cols, response = execute_query(query, params=query_params)

    if response:
        df = pd.DataFrame(data=response, columns=cols)
        df.drop('rk', axis=1, inplace=True)
        date_diff = (date - pd.to_datetime(df['datekey'])).dt.days
        df = df[date_diff <= day_threshold]
        return df

    return None
In [9]:
def get_fundamental_data(
    tickers: Optional[Union[str, array_1d]] = None,
    start_date: Optional[date_obj] = None,
    end_date: Optional[date_obj] = None,
    attributes: Optional[array_1d] = None,
    raw: bool = True,
) -> Dict[str, pd.DataFrame]:
    """Gets data from fundamental table and returns a dictionary of DataFrame
    keyed by attributes. Each of these DataFrame has tickers as columns and
    rows as dates and values of the attribute for that day

    If the ticker is a string, only data for that ticker will be queried
    If the ticker is array1d of strings, data for all  tickers will be queried
    If no arguments are specified, data from the entire equity table is returned

    Args:
        tickers: Gets data for only this ticker or list of tickers
        attributes: Gets data for columns corresponding to list of attributes
        start_date: Gets data from after this date (inclusive)
        end_date: Gets data from before this date (inclusive)
        raw: If true, data pulled from fundamental_raw table
             Otherwise data pulled from fundamental table
    Returns:
        A Dictionary of DataFrames keyed by attributes
    """
    # Query database for rows
    table_name = 'fundamental_raw' if raw else 'fundamental'

    response_df = get_data(
        table_name, tickers, start_date, end_date, attributes
    )

    # Covert to datetime
    response_df['datekey'] = pd.to_datetime(response_df['datekey'])

    # Check if response_df is None and throw error
    if response_df is None:
        raise TypeError("Response from query was None")

    # Sort on date
    response_df.sort_values(['datekey'], inplace=True)

    # Set date as index
    response_df.set_index('datekey', inplace=True)

    # Get attributes that are not date and ticker columns
    non_date_ticker_attr = response_df.drop(['ticker'], axis=1).columns

    return create_attribute_dictionary(response_df, non_date_ticker_attr)
In [10]:
def get_fundamental_filled(
    tickers: Optional[Union[str, array_1d]] = None,
    start_date: Optional[date_obj] = None,
    end_date: Optional[date_obj] = None,
    attributes: Optional[array_1d] = None,
    day_threshold: int = 365,
    dates: Optional[array_1d] = None,
) -> Dict[str, pd.DataFrame]:
    """Pulls data from fundamental_raw table by first finding the most recent
    data, and then forward filling any missing data

    Args:
        tickers: Gets data for only this ticker or list of tickers
        start_date: Gets data from after this date (inclusive)
        end_date: Gets data from before this date (inclusive)
        attributes: Gets data for columns corresponding to list of attributes
        day_threshold: threshold for recent data. Default is 365 days
                       If data is not found in that range for a given
                       ticker, that ticker's fill data wil be replaced with NaN
        dates: Dates to reindex data for. Default is None, no resampling will be
               done. Otherwise, if dates are provided, output will be processed
               by resample_index_dict before returning
    Returns:
        A Dictionary of DataFrames keyed by attributes
    """
    # Get initial clean data
    initial_data = get_fundamental_data_day(
        tickers, start_date, attributes, day_threshold=day_threshold
    )
    # Make tickers columns and attributes as the index
    initial_data.index = initial_data['ticker']
    initial_data.drop('ticker', axis=1, inplace=True)
    initial_data = initial_data.transpose()

    # Get the rest of the data
    fund_data = get_fundamental_data(tickers, start_date, end_date, attributes)

    # If clean data was not found for certain tickers, add NaN in their place
    initial_tickers = set(initial_data.columns)
    fund_tickers = set(fund_data[list(fund_data.keys())[0]].columns)
    missing_tickers = fund_tickers - initial_tickers
    missing_data = np.full(
        (initial_data.shape[0], len(missing_tickers)), np.nan
    )
    missing_df = pd.DataFrame(
        missing_data, columns=missing_tickers, index=initial_data.index
    )

    # Construct full initial dataframe with NaNs from the missing tickers
    full_initial = pd.concat([initial_data, missing_df], axis=1)
    full_initial = full_initial.reindex(sorted(full_initial.columns), axis=1)

    # For each attribute, set the first day as the clean initial data for all
    # tickers, forward fill, and then reindex using the original index
    for attr in attributes:
        idx = fund_data[attr].index
        cols = fund_data[attr].columns
        fund_data[attr].loc[start_date] = full_initial.loc[attr, cols]
        fund_data[attr].sort_index(inplace=True)
        fund_data[attr].fillna(method='ffill', inplace=True)
        fund_data[attr] = fund_data[attr].loc[idx]

    if dates is not None:
        fund_data = resample_index_dict(fund_data, dates, True)
    return fund_data
In [11]:
def get_equity_data(
    tickers: Optional[Union[str, array_1d]] = None,
    start_date: Optional[date_obj] = None,
    end_date: Optional[date_obj] = None,
    attributes: Optional[array_1d] = None,
    raw: bool = True,
) -> Dict[str, pd.DataFrame]:
    """Gets data from equity table and returns a dictionary of DataFrame
    keyed by attributes. Each of these DataFrame has tickers as columns and
    rows as dates and values of the attribute for that day

    If the ticker is a string, only data for that ticker will be queried
    If the ticker is a array1d of strings, data for the tickers will be queried
    If no arguments are specified, data from the entire equity table is returned

    Args:
        tickers: Gets data for only this ticker or list of tickers
        attributes: Gets data for columns corresponding to list of attributes
        start_date: Gets data from after this date (inclusive)
        end_date: Gets data from before this date (inclusive)
        raw: If true, data pulled from equity_raw table
             Otherwise data pulled from equity table
    Returns:
        A Dictionary of DataFrames keyed by attributes.
    """

    table_name = 'equity_raw' if raw else 'equity'

    # Query database for rows
    response_df = get_data(
        table_name, tickers, start_date, end_date, attributes
    )
    response_df['date'] = pd.to_datetime(response_df['date'])

    # Check if response_df is None and throw error
    if response_df is None:
        raise TypeError("Response from query was None")

    # Sort on date
    response_df.sort_values(['date'], inplace=True)

    # Set date as index
    response_df.set_index('date', inplace=True)

    # Get attributes that are not date and ticker columns
    non_date_ticker_attr = response_df.drop(['ticker'], axis=1).columns

    return create_attribute_dictionary(response_df, non_date_ticker_attr)
In [12]:
def create_attribute_dictionary(
    df: pd.DataFrame, attributes: array_1d
) -> Dict[str, pd.DataFrame]:
    """Takes Dataframe with attributes as columns and returns a dictionary of
    DataFrames. The DataFrames in the dictionary have specified index (e.g.
    date) and specified columns (e.g. tickers).

    Args:
        df: Pandas DataFrame with attributes in the the columns
        attributes: list of attributes for keys in the dictionary,
                    these must be columns in the DataFrame
    Returns:
        Dictionary of DataFrames keyed by attribute whose index and
        columns were given as parameters and values are the attribute value.
    """
    df_dict = {}
    df = df.drop_duplicates()  # somehow duplicate rows appear??

    # split up by attribute, shape it so tickers are columns
    for attr in attributes:
        attr_df = df[['ticker', attr]].pivot(columns='ticker', values=attr)
        df_dict[attr] = attr_df.reindex(
            sorted(attr_df.columns), axis=1
        )  # sort by columns by ticker symbol

    return df_dict

From the database, extract the following Fundamental Factors for each stock in the S&P500:

Asset Turnover Ratio ('assetturnover'): This indicator measures the value of a company's sales relative to its assets. A higher asset turnover ratio indicates a company that is efficient at generating revenue from assets.

$Asset Turnover Ratio = \frac{Sales}{\frac{Beginning Assets + Ending Assets}{2}}$

Current Ratio ('currentratio'): This indicator measures the ability of a company to pay off short-term obligations

$Current Ratio = \frac{Current Assets}{Current Liabilities}$

Earnings Before Interest, Taxes, Depreciation, and Amortization ('ebitda'): This indicator measures the overall financial performance, or profitability, of a company. It is oftentimes more accurate an indicator because it represents earnings before accounting and financial deductions.

$EBITDA = NET INCOME + INTEREST + TAXES + DEPRECIATION + AMORTIZATION$

Earnings per Share ('eps'): This indicator measures how much money a company makes for each share of its stock, an indication of corporate value. The larger the Earnings per Share, the more profitable the company is.

$Earnings per Share = \frac{Net Income - Preferred Dividends}{End-of-Period Common Shares Outstanding}$

Enterprise Value ('ev'): This indicator measures the enterprise value of a company, including market capitalization adjusted for short-term and long-term debts as well as any cash on the balance sheet. This is often viewed as the "selling price" that a company would go for.

$Enterprise Value = Market Capitalization + Total Debt - Cash$

Gross Profit Margin ('grossmargin') This indicator measures the portion of each dollar of revenue that a company retains as gross profit.

$Gross Profit = Sales - Cost of Goods Sold$

Price to Earnings Ratio ('pe'): This indicator shows the current share price relative to the Earnings Per Share of a company. Investors use this ratio to determine the relative value of a company's shares.

$P/E = \frac{Market Value per Share}{Earnings per Share}$

Return on Equity ('roe'): This indicator measures the profitability of a company relative to other companies in the market. As a shortcut, investors can consider an ROE near the long-term average of the S&P 500 (14%) as an acceptable ratio and anything less than 10% as poor.

$Return on Equity = \frac{Net Income}{Average Shareholder's Equity}$

Return on Invested Capital ('roic'): This indicator measures the efficiency of a company at allocating capital towards profitable investments. A company is creating value if its ROIC exceeds 2% and destroying value if less than 2%.

$ROIC = \frac{Net Operating Profit after Tax}{Invested Capital}$

Working Capital ('workingcapital'): This indicator measures the liquidity of a company. Companies with substantial positive working capital should have the potential to invest and grow. If a company's current assets do not exceed its current liabilities, then it may have trouble growing or paying back creditors, or even go bankrupt.

$Working Capital = Current Assets - Current Liabilities$

Research and Development ('rnd'): This indicator measures a company's activities in driving innovation and development of products to produce new products or services.

In [13]:
'assets' : 'assets' 
'assetturnover' : 'asset turnover ratio'
'bvps' : 'book value of equity per share' 
'capex' : 'capital expenditures' 
'currentratio' : 'current ratio'
'debt' : 'debt'
'divyield' : 'dividend yield' 
'dps' : 'dividends per share' 
'ebit' : 'earnings before interest and taxes' 
'ebitda', 'earnings before interest, taxes, depreciation, and amortization'
'ebitdamargin' : 'ebitda as a percentage of total revenue' 
'eps' : 'earnings per share'  
'equity' : 'equity'  
'ev' : 'enterprise value' 
'fcf' : 'Free cash flow' 
'grossmargin' : 'gross profit margin'  
'inventory' : 'inventory' 
'investments' : 'investments'
'liabilities' : 'liabilities'  
'marketcap' : 'market capital' 
'netinc' : 'net income' 
'opex' : 'operating margin' 
'opinc' : 'operating income' 
'payables' : 'Accounts payable' 
'payoutratio' : 'Payout Ratio' 
'pb' : 'price to book ratio' 
'pe' : 'price to earnings ratio' 
'receivables' : 'receivables'
'revenue' : 'revenue'
'rnd' : 'research and development' 
'roa' : 'return on assets' 
'roe' : 'return on equity' 
'roic' : 'return on invested capital'
'ros' : 'return on sales' 
'workingcapital' : 'working capital'
  File "<ipython-input-13-27d8b95a6110>", line 1
    'assets' : 'assets'
    ^
SyntaxError: illegal target for annotation

Also extract Equity Data for each company in the S&P500

Opening Price ('open'): The opening price for a stock is the price exactly at the opening of trading in the New York Stock Exchange at 9:30am EST.

High Price ('high'): The high price for a stock is the peak intraday price of a stock.

Low Price ('low'): The low price for a stock is the lowest intraday price of a stock.

Adjusted Closing Price ('close'): The adjusted closing price is the closing price of a stock adjusted for stock splits, dividends, and rights offerings.

Volume ('volume'): The amount of stocks that change hands over the course of a day for a given company.

Dividends ('dividends'): The distribution a portion of a company's earnings to preferred stockholders.

Unadjusted Closing price ('closeunadj'): The close price for a stock is the price exactly at the close of the New York Stock Exchange at 4:00pm EST.

In [13]:
def extract_data(stocks, start_date, end_date):
    """Queries the PostgreSQL database for Equity and Financial data 
    for the given list of stocks between the start date and the end date

    Args:
        stocks: A list of tickers
        start_date: A datetime object representing the start date
        end_date: A datetime object representing the end date

    Returns:
        Two dataframes, the first one being equity data and the second
        being the fundamental data
    """
    
    fundamental = get_fundamental_filled(stocks, start_date, end_date, 
                                         ['ebitda',
                                          'eps',
                                          'ev',
                                          'grossmargin',
                                          'pe',
                                          'rnd',
                                          'marketcap'])
    
    equity_data = get_equity_data(stocks, start_date, end_date)
    
    return equity_data, fundamental

We separate the time series data for each variable for each stock into separate dataframes for each variable of interest

In [14]:
#Extract Equity Data and Fundamental Data for each ticker in the S&P500
equity_data, fundamental_data = extract_data(tickers, start_date, end_date)

#Create Dataframes for each variable of interest where columns correspond to the ticker and rows correspond to the date
#EQUITY
close_data = equity_data['close']
#FUNDAMENTAL
ebitda_data = fundamental_data['ebitda']
eps_data = fundamental_data['eps']
ev_data = fundamental_data['ev']
gm_data = fundamental_data['grossmargin']
pe_data = fundamental_data['pe']
rnd_data = fundamental_data['rnd']
mc_data = fundamental_data['marketcap']

ebitda_data.index.names = ['date']
eps_data.index.names = ['date']
ev_data.index.names = ['date']
gm_data.index.names = ['date']
pe_data.index.names = ['date']
rnd_data.index.names = ['date']
mc_data.index.names = ['date']

Unfortunately, we don't have fundamental data available for every stock in the S&P500

In [15]:
print(f' Number of stocks in close_data: {close_data.shape[1]}')
print(f' Number of stocks in ev_data: {ev_data.shape[1]}')
print(f' Number of stocks in stock_info: {stock_info.shape[0]}')
 Number of stocks in close_data: 501
 Number of stocks in ev_data: 495
 Number of stocks in stock_info: 505

To account for this, we prune the equity data to the size of the available fundamental data

In [16]:
# For the list of tickers in the S&P500, drop tickers that aren't present in the fundamental data
for ind, x in stock_info['Ticker'].items(): 
    if x not in ev_data.columns:
        stock_info = stock_info[stock_info.Ticker != x]

#For each ticker in the close price dataframe, drop tickers that aren't represent in the fundamental data
for x in close_data.columns:
    if x not in ev_data.columns:
        close_data.drop([x], axis=1, inplace=True)

stock_info.reset_index(inplace = True, drop = True)
In [17]:
print(f' Number of stocks in close_data: {close_data.shape[1]}')
print(f' Number of stocks in ev_data: {ev_data.shape[1]}')
print(f' Number of stocks in stock_info: {stock_info.shape[0]}')
 Number of stocks in close_data: 495
 Number of stocks in ev_data: 495
 Number of stocks in stock_info: 495

Next, we check for the presence of missing data in the datasets

In [18]:
print(f'NaN data in stock_info: {stock_info.isna().sum().sum()}')
print(f'NaN data in close_data: {close_data.isna().sum().sum()}')
print(f'NaN data in ebitda_data: {ebitda_data.isna().sum().sum()}')
print(f'NaN data in gm_data: {gm_data.isna().sum().sum()}')
print(f'NaN data in pe_data: {pe_data.isna().sum().sum()}')
print(f'NaN data in eps_data: {eps_data.isna().sum().sum()}')
print(f'NaN data in ev_data: {ev_data.isna().sum().sum()}')
print(f'NaN data in rnd_data: {rnd_data.isna().sum().sum()}')
print(f'NaN data in mc_data: {mc_data.isna().sum().sum()}')
NaN data in stock_info: 0
NaN data in close_data: 93
NaN data in ebitda_data: 271
NaN data in gm_data: 223
NaN data in pe_data: 367
NaN data in eps_data: 313
NaN data in ev_data: 183
NaN data in rnd_data: 223
NaN data in mc_data: 183

Since there are missing data, we forward fill NaN values, then backward fill for entries that start with NaNs. Dropping rows that include NaNs would greatly diminish the size of the datasets and lead to information loss. We start by forward filling the data because fundamental factors remain relatively constant on a day-to-day time-scale. This is because fundamental factors are often reported by companies on a quarterly basis, or 4 times per year$^{[11]}$.

In [19]:
close_data.fillna(method='ffill', inplace=True)
close_data.fillna(method='bfill', inplace=True)


ebitda_data.fillna(method='ffill', inplace=True)
ebitda_data.fillna(method='bfill', inplace=True)

eps_data.fillna(method='ffill', inplace=True)
eps_data.fillna(method='bfill', inplace=True)

ev_data.fillna(method='ffill', inplace=True)
ev_data.fillna(method='bfill', inplace=True)

gm_data.fillna(method='ffill', inplace=True)
gm_data.fillna(method='bfill', inplace=True)

pe_data.fillna(method='ffill', inplace=True)
pe_data.fillna(method='bfill', inplace=True)
pe_data.dropna(axis=1, inplace=True)


rnd_data.fillna(method='ffill', inplace=True)
rnd_data.fillna(method='bfill', inplace=True)

mc_data.fillna(method='ffill', inplace=True)
mc_data.fillna(method='bfill', inplace=True)
In [20]:
print(f'NaN data in stock_info: {stock_info.isna().sum().sum()}')
print(f'NaN data in close_data: {close_data.isna().sum().sum()}')
print(f'NaN data in ebitda_data: {ebitda_data.isna().sum().sum()}')
print(f'NaN data in gm_data: {gm_data.isna().sum().sum()}')
print(f'NaN data in pe_data: {pe_data.isna().sum().sum()}')
print(f'NaN data in eps_data: {eps_data.isna().sum().sum()}')
print(f'NaN data in ev_data: {ev_data.isna().sum().sum()}')
print(f'NaN data in rnd_data: {rnd_data.isna().sum().sum()}')
print(f'NaN data in mc_data: {mc_data.isna().sum().sum()}')
NaN data in stock_info: 0
NaN data in close_data: 0
NaN data in ebitda_data: 0
NaN data in gm_data: 0
NaN data in pe_data: 0
NaN data in eps_data: 0
NaN data in ev_data: 0
NaN data in rnd_data: 0
NaN data in mc_data: 0

When we check the closing price data, we can see that we still have data missing for some days. This is because stock exchanges, like the New York Stock Exchange, follow a regular work week, where markets are closed on weekends and holidays$^{[12]}$. This makes our time-series data non-continous.

In [21]:
close_data.head(10)
Out[21]:
ticker A AAL AAP AAPL ABBV ABC ABMD ABT ACN ADBE ... XLNX XOM XRAY XRX XYL YUM ZBH ZBRA ZION ZTS
date
2019-04-01 81.56 32.35 173.63 47.810 80.78 79.09 277.76 79.66 176.32 272.17 ... 129.73 81.73 49.84 33.01 80.56 100.58 127.76 211.91 46.72 101.54
2019-04-02 81.14 32.99 173.34 48.505 83.07 74.49 284.20 79.62 175.37 271.35 ... 129.09 81.38 50.15 32.95 80.04 100.18 127.50 213.62 46.93 102.04
2019-04-03 81.94 33.71 171.69 48.837 83.08 75.05 284.94 79.50 177.19 271.50 ... 128.57 80.90 50.35 33.09 79.69 100.54 126.48 215.04 47.18 102.12
2019-04-04 80.83 33.93 174.00 48.922 82.81 75.97 282.87 78.62 177.16 267.89 ... 127.82 82.05 50.26 33.05 80.00 100.45 126.56 214.01 47.68 101.98
2019-04-05 81.47 34.06 176.78 49.250 83.45 77.27 284.91 79.00 178.15 267.45 ... 129.08 82.49 50.40 33.38 80.34 99.96 127.21 218.33 47.57 102.12
2019-04-08 81.69 33.88 177.59 50.025 83.98 78.12 282.95 78.52 178.91 268.81 ... 129.90 83.00 50.33 33.75 81.33 99.59 128.76 221.44 47.62 102.23
2019-04-09 81.42 33.31 175.65 49.875 82.69 77.43 281.65 78.67 177.73 268.99 ... 128.93 81.93 50.42 33.24 79.57 99.48 128.66 220.12 46.90 101.46
2019-04-10 81.68 34.02 178.80 50.155 82.95 74.00 280.54 78.97 177.00 271.58 ... 131.59 81.56 50.61 33.84 79.37 100.04 128.84 225.61 47.14 101.33
2019-04-11 81.08 34.81 180.96 49.737 81.77 74.18 271.89 78.51 178.36 271.90 ... 132.07 81.95 50.49 34.25 80.90 100.77 128.54 226.70 47.50 101.06
2019-04-12 80.98 34.69 180.11 49.718 80.78 74.74 270.08 78.01 178.64 271.86 ... 134.41 80.92 51.24 34.39 82.37 101.40 127.92 233.74 48.42 101.67

10 rows × 495 columns

To account for this missing aspect of the data, we further forward fill the data for missing dates. This is the most logical thing to do since stocks are not traded, the actual price of the stock remains unchanged (even though perceptions of the stock's price continue to change). We expect that on Monday market open, after the weekend break, that stock prices change more drastically on average from their Friday closing price than close/open prices do throughout the work week.

In [22]:
close_data = close_data.resample('D').ffill().reset_index()
close_data.set_index('date', inplace=True)

ebitda_data = ebitda_data.resample('D').ffill()
eps_data = eps_data.resample('D').ffill()
ev_data = ev_data.resample('D').ffill()
gm_data = gm_data.resample('D').ffill()
pe_data = pe_data.resample('D').ffill()
rnd_data = rnd_data.resample('D').ffill()
mc_data = mc_data.resample('D').ffill()
In [23]:
close_data.head(10)
Out[23]:
ticker A AAL AAP AAPL ABBV ABC ABMD ABT ACN ADBE ... XLNX XOM XRAY XRX XYL YUM ZBH ZBRA ZION ZTS
date
2019-04-01 81.56 32.35 173.63 47.810 80.78 79.09 277.76 79.66 176.32 272.17 ... 129.73 81.73 49.84 33.01 80.56 100.58 127.76 211.91 46.72 101.54
2019-04-02 81.14 32.99 173.34 48.505 83.07 74.49 284.20 79.62 175.37 271.35 ... 129.09 81.38 50.15 32.95 80.04 100.18 127.50 213.62 46.93 102.04
2019-04-03 81.94 33.71 171.69 48.837 83.08 75.05 284.94 79.50 177.19 271.50 ... 128.57 80.90 50.35 33.09 79.69 100.54 126.48 215.04 47.18 102.12
2019-04-04 80.83 33.93 174.00 48.922 82.81 75.97 282.87 78.62 177.16 267.89 ... 127.82 82.05 50.26 33.05 80.00 100.45 126.56 214.01 47.68 101.98
2019-04-05 81.47 34.06 176.78 49.250 83.45 77.27 284.91 79.00 178.15 267.45 ... 129.08 82.49 50.40 33.38 80.34 99.96 127.21 218.33 47.57 102.12
2019-04-06 81.47 34.06 176.78 49.250 83.45 77.27 284.91 79.00 178.15 267.45 ... 129.08 82.49 50.40 33.38 80.34 99.96 127.21 218.33 47.57 102.12
2019-04-07 81.47 34.06 176.78 49.250 83.45 77.27 284.91 79.00 178.15 267.45 ... 129.08 82.49 50.40 33.38 80.34 99.96 127.21 218.33 47.57 102.12
2019-04-08 81.69 33.88 177.59 50.025 83.98 78.12 282.95 78.52 178.91 268.81 ... 129.90 83.00 50.33 33.75 81.33 99.59 128.76 221.44 47.62 102.23
2019-04-09 81.42 33.31 175.65 49.875 82.69 77.43 281.65 78.67 177.73 268.99 ... 128.93 81.93 50.42 33.24 79.57 99.48 128.66 220.12 46.90 101.46
2019-04-10 81.68 34.02 178.80 50.155 82.95 74.00 280.54 78.97 177.00 271.58 ... 131.59 81.56 50.61 33.84 79.37 100.04 128.84 225.61 47.14 101.33

10 rows × 495 columns

Next, we query the database for the closing price of the S&P500 Index as a whole

In [24]:
def get_sp_close(start_date: date_obj, end_date: date_obj) -> pd.Series:
    """temporary function to get S&P500 closing prices

    Args:
        start_date: Beginning of date range for data
        end_date: End of date range for data
    Returns:
        Series of close prices
    """
    
    api_key = '24XHLRJWJKXJBSSA'
    base_url = (
        'https://www.alphavantage.co/query?function='
        'TIME_SERIES_DAILY_ADJUSTED&symbol='
    )
    functions = '&symbol=SPY&apikey={}&datatype=csv&outputsize=full'.format(
        api_key
    )
    full_url = base_url + functions
    sp500_data = pd.read_csv(full_url)[
        ['timestamp', 'adjusted_close', 'dividend_amount']
    ]
    sp500_data.rename(
        columns={
            'timestamp': 'date',
            'adjusted_close': 'close',
            'dividend_amount': 'dividend',
        },
        inplace=True,
    )
    sp500_data['date'] = pd.to_datetime(sp500_data['date'])
    sp500_data.sort_values('date', inplace=True)
    sp500_data = sp500_data[
        (sp500_data['date'] >= start_date) & (sp500_data['date'] <= end_date)
    ]
    sp500_data.index = sp500_data['date']
    return sp500_data['close']
In [25]:
# Get the S&P500 closing price
sp_data = get_sp_close(start_date, end_date)
sp_data = sp_data.resample('D').ffill().reset_index()
sp_data.set_index('date', inplace=True)

So how well has the S&P500 performed in this time period?

In [26]:
fig, ax = plt.subplots(figsize=(10,10))
ax.plot(sp_data.index, sp_data, label = 'Daily SPY')
ax.plot(sp_data.rolling(10).mean(), label = '10 Day Rolling Mean')
ax.plot(sp_data.rolling(30).mean(), label = '30 Day Rolling Mean')
ax.plot(sp_data.rolling(100).mean(), label = '100 Day Rolling Mean')
ax.set_title('S&P500 Closing Price')
ax.set_xlabel('Date')
ax.set_ylabel('Closing price (in USD)')
ax.legend()
fig.show()

To calculate the returns of the S&P500 during this period we use the following formula:

$ Returns = \frac{Initial Price - Final Price}{Initial Price} $

In [58]:
print('S&P500 returns between April 1, 2019 and December 1, 2019: ' + str((sp_data['close'][-1]-sp_data['close'][0])/sp_data['close'][0]*100)+ '%')
S&P500 returns between April 1, 2019 and December 1, 2019: 11.01199083786702%

Throughout the year, for most days, the S&P500 had close to 0% daily returns. The 11% returns calculated above are largely due to one month (November to December) of high average returns near the end of the period

In [156]:
fig, ax = plt.subplots(figsize=(8,8))
ax.plot(sp_data.index, stats.zscore(sp_data), label = 'SPY')
ax.set_title("Z-scored Clsoing Price for S&P500")
ax.set_ylabel("Standard Deviations away from mean closing price")
ax.set_xlabel("Date")
plt.show()

From the above figure, the S&P500 had closing prices ~1 to 2 standard deviations below the mean at the beginning of the year and ~1 to 2 standard deviations above the mean at the end of the period

In [157]:
sp_data.pct_change().plot.hist(bins = 60)
plt.xlabel("Daily returns %")
plt.ylabel("Frequency")
plt.title("S&P500 daily returns data")
plt.text(-0.03,100,"Extremely Low\nreturns")
plt.text(0.02,100,"Extremely High\nreturns")
plt.show()

The histogram above shows that most days saw litte to no returns and that the S&P500 had more good days than bad days between April and December of 2019

Which stocks in the S&P500 have the highest EBITDA?

In [164]:
top_20_ebitda = ebitda_data.mean().sort_values(ascending=False)[0:20]

fig, ax = plt.subplots(figsize=(15,8))
ax.bar(top_20_ebitda.index,top_20_ebitda.to_numpy())
fig.show()

Berkshire Hathaway, Apple, Microsoft, AT&T, and Google have the highest EBIDTA. This indicates that these companies have the highest overall financial performance/profitability.

Which stocks in the S&P500 have the highest Earnings per Share?

In [163]:
top_20_eps = eps_data.mean().sort_values(ascending=False)[0:20]

fig, ax = plt.subplots(figsize=(15,8))
ax.bar(top_20_eps.index,top_20_eps.to_numpy())
fig.show()

NVR (A home construction company), Booking Holdings (owner of Kayak, Priceline, and Booking.com), Autozone, and Bio-Rad labratories have the highest Earnings per Share. This indicates that these companies are the most profitable (make the most money per share of stock).

Which companies in the S&P500 have the highest Price to Earnings Ratio?

In [159]:
top_20_pe = pe_data.mean().sort_values(ascending=False)[0:20]

fig, ax = plt.subplots(figsize=(15,8))
ax.bar(top_20_pe.index,top_20_pe.to_numpy())
fig.show()

Weyerhaeuser (Real Estate Investment Company), ServiceNow (Software Company), Cardinal Health, and Live Nation Entertainment have the highest Price to Earnings Ratio. These companies either have the highest potential for growth or are the most overvalued.

Which companies in the S&P 500 have the highest Research and Development?

In [158]:
top_20_rnd = rnd_data.mean().sort_values(ascending=False)[0:20]

fig, ax = plt.subplots(figsize=(15,8))
ax.bar(top_20_rnd.index,top_20_rnd.to_numpy())
fig.show()

Amazon, Google, Microsoft, Apple, Intel, and Facebook are the companies with the highest spending towards research and development. All of these companies are in the technology sector. Following these companies are all of the largest Health Care companies like Johnson and Johnson, Merck, Abbvie, and Pfizer.

Which companies in the S&P have the highest Enterprise Value?

In [119]:
top_20_ev = ev_data.mean().sort_values(ascending=False)[0:20]

fig, ax = plt.subplots(figsize=(15,8))
ax.bar(top_20_ev.index,top_20_ev.to_numpy())
fig.show()

Apple, Amazon, Microsoft, Google, JPMorgan Chase, Berkshire Hathaway, and Facebook all have the largest Enterprise Value. These companies also happen to be in the top 10 companies in the S&P500 by market cap.

So What Sectors Make up the S&P500?

In [83]:
frequencies = stock_info.groupby('Sector').size()
        
fig1, ax1 = plt.subplots(figsize=(10,10))
ax1.pie(frequencies, labels=frequencies.index, autopct='%1.1f%%', startangle=90, )
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

From the above pie chart, we can see the top 2 largest sectors represented in the S&P500: 14.5% of companies in the S&P500 are Technology companies and 14.3% are Industrial companies

The two lowest represented sectors in the S&P500 are Communication Services at 4.4% and Energy at 5.1%

Overall, the S&P500 is heavily biased towards Technology, Financial Services, Industrials, and Health Care companies.

In [127]:
#Calculate the total market cap of the top 10 companies in the S&P500
top_10_mc = mc_data.iloc[-1,].sort_values(ascending=False)[0:10].sum()

#Calculate total market cap of the S&P500
all_mc = mc_data.iloc[-1,].sort_values(ascending=False).sum()

print("\nPercent of Total S&P500 Market Capitalization Represented by top 10 stocks:")
print(top_10_mc/all_mc * 100)
Percent of Total S&P500 Market Capitalization Represented by top 10 stocks:
24.005589113666794
In [126]:
top_20_mc = mc_data.iloc[-1].sort_values(ascending=False)[0:20]

fig, ax = plt.subplots(figsize=(15,8))
ax.bar(top_20_mc.index,top_20_mc.to_numpy())
ax.set_title("S&P500 Stocks with Largest Market Capitalization")
ax.set_ylabel('Market Capitalization')
ax.set_xlabel('Tickers')
fig.show()

In fact, the top 10 largest companies in the S&P500, representing about 24% of the index by market capitalization are Apple Inc., Microsoft, Amazon.com, Google, Facebook, Berkshire Hathaway, JPMorgan Chase & Co. and Visa Inc, Johnson & Johnson, and Walmart. (5 Technology companies, 3 financial companies, 1 Health Care company, and 1 Consumer Staples).

Conclusion: So maybe the S&P500 isn't actually diverse enough?

What is the Average Market Capitalization by Sector in the S&P500?

In [217]:
average_mc = mc_data.mean().sort_values(ascending=False)

frequencies = stock_info.groupby('Sector').size()
mc_sectors = frequencies.copy()
for i, row in mc_sectors.items():
    mc_sectors[i] = 0

for i, row in stock_info.iterrows():
    for j, row2 in average_mc.items():
        if(stock_info.iloc[i]['Ticker'] == j):
            
            for k, row3 in mc_sectors.items():
                if (k == stock_info.iloc[i]['Sector']):
                    mc_sectors[k] += row2
In [218]:
for i, val in mc_sectors.items():
    mc_sectors[i] /= frequencies[i]
mc_sectors
Out[218]:
Sector
Communication Services    129098021623
Consumer Discretionary     46049478560
Consumer Staples           64781205247
Energy                     46974507132
Financials                 52748772185
Health Care                54525666019
Industrials                33620188040
Information Technology     76429328280
Materials                  24895016205
Real Estate                24148276560
Utilities                  28625746227
dtype: int64
In [219]:
fig1, ax1 = plt.subplots(figsize=(10,10))
ax1.pie(mc_sectors, labels=mc_sectors.index, autopct='%1.1f%%', startangle=90, )
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

Although a handfull of Technology, Healthcare, and Financial companies in the top of the S&P500 represent around 24% of the total market capitalization of the S&P500, we can see from this graph above that the rest of the companies in these sectors have considerably lower market capitalization, making these sectors represent comparatively less total market capitalization overall. Communication Services represent the most of the S&P500 with 22.2% of the S&P market cap being represented by these companies, however, none of these companies are particularly high in market capitalization, so we can conclude that these companies have consistent market capitalization. The next largest sectors by market capitalization are information Technology and Consumer Staples.

What is the Average Enterprise Value (EV) by Sector in the S&P500?

In [214]:
average_ev = ev_data.mean().sort_values(ascending=False)

frequencies = stock_info.groupby('Sector').size()
ev_sectors = frequencies.copy()
for i, row in ev_sectors.items():
    ev_sectors[i] = 0

for i, row in stock_info.iterrows():
    for j, row2 in average_ev.items():
        if(stock_info.iloc[i]['Ticker'] == j):
            
            for k, row3 in ev_sectors.items():
                if (k == stock_info.iloc[i]['Sector']):
                    ev_sectors[k] += row2
In [215]:
for i, val in ev_sectors.items():
    ev_sectors[i] /= frequencies[i]
In [216]:
fig1, ax1 = plt.subplots(figsize=(10,10))
ax1.pie(ev_sectors, labels=ev_sectors.index, autopct='%1.1f%%', startangle=90, )
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

The results for enterprise value are very similar to those results seen by Market Capitalization. This is because both values measure a company's "worth", the exception being that enterprise value accounts for debts. As such there are a few notable differences in sectors between market capitalization and enterprise value. Firstly, the Financial Sector gained over 1.7% by enterprise value when compared to market capitalization. This indicates that Financial companies tend to be undervalued by market cap because they have little debt. On the other hand, companies in Information Technology and Healthcare tend to be overvalued by market cap because of higher debts incurred.

What is the average EBITDA by Sector in the S&P500?

In [186]:
average_ebitda = ebitda_data.mean().sort_values(ascending=False)

frequencies = stock_info.groupby('Sector').size()
ebitda_sectors = frequencies.copy()
for i, row in ebitda_sectors.items():
    ebitda_sectors[i] = 0

for i, row in stock_info.iterrows():
    for j, row2 in average_ebitda.items():
        if(stock_info.iloc[i]['Ticker'] == j):
            
            for k, row3 in ebitda_sectors.items():
                if (k == stock_info.iloc[i]['Sector']):
                    ebitda_sectors[k] += row2
In [187]:
for i, val in ebitda_sectors.items():
    ebitda_sectors[i] /= frequencies[i]
ebitda_sectors
Out[187]:
Sector
Communication Services    3340560419
Consumer Discretionary     904265295
Consumer Staples          1105396679
Energy                    1605138364
Financials                1726352277
Health Care               1110906659
Industrials                817797856
Information Technology    1312425160
Materials                  617541002
Real Estate                445529439
Utilities                 1054350409
dtype: int64
In [188]:
fig1, ax1 = plt.subplots(figsize=(10,10))
ax1.pie(ebitda_sectors, labels=ebitda_sectors.index, autopct='%1.1f%%', startangle=90, )
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

From this graph we can see that Energy, Financial, and Communications companies have much larger earnings before interest, taxes, debts, and ammortization, where communication companies represent a significantly larger portion of earnings in the S&P500.

What is the Average Price to Earnings Ratio in the S&P by Sector?

In [192]:
average_pe = pe_data.mean().sort_values(ascending=False)

frequencies = stock_info.groupby('Sector').size()
pe_sectors = frequencies.copy()
for i, row in pe_sectors.items():
    pe_sectors[i] = 0

for i, row in stock_info.iterrows():
    for j, row2 in average_pe.items():
        if(stock_info.iloc[i]['Ticker'] == j):
            
            for k, row3 in pe_sectors.items():
                if (k == stock_info.iloc[i]['Sector']):
                    pe_sectors[k] += row2
In [193]:
for i, val in pe_sectors.items():
    pe_sectors[i] /= frequencies[i]
pe_sectors
Out[193]:
Sector
Communication Services     71
Consumer Discretionary     32
Consumer Staples           22
Energy                      6
Financials                  0
Health Care                51
Industrials                20
Information Technology     98
Materials                  16
Real Estate               289
Utilities                  24
dtype: int64
In [194]:
fig1, ax1 = plt.subplots(figsize=(10,10))
ax1.pie(pe_sectors, labels=pe_sectors.index, autopct='%1.1f%%', startangle=90, )
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

High price to earnings ratio often indicates companies that are either overvalued, or have the highest potential for growth. Real Estate companies represent 45.9% of the total price to earnings of the S&P500. Real Estate is likely prime for growth and reaching overvalued as the industry has been steadily recovering since the 2008 Housing Crisis$^{[13]}$. Information Technology also likely has high price to earnings ratio since it has seen the largest growth in recent years and it is expected to only grow more in the comming years.

What is the Average Earnings per Share in the S&P500 by Sector?

In [249]:
average_eps = eps_data.mean().sort_values(ascending=False)

frequencies = stock_info.groupby('Sector').size()
eps_sectors = frequencies.copy().astype('float')
for i, row in eps_sectors.items():
    eps_sectors[i] = 0

for i, row in stock_info.iterrows():
    for j, row2 in average_eps.items():
        if(stock_info.iloc[i]['Ticker'] == j):
            
            for k, row3 in eps_sectors.items():
                if (k == stock_info.iloc[i]['Sector']):
                    eps_sectors[k] += row2
In [250]:
for i, val in eps_sectors.items():
    eps_sectors[i] = float(eps_sectors[i]/frequencies[i])
In [251]:
fig1, ax1 = plt.subplots(figsize=(10,10))
ax1.pie(eps_sectors, labels=eps_sectors.index, autopct='%1.1f%%', startangle=90, )
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

Overall, Consumer Discretionary comapanies represent the highest percentage of Earnings per share followed by Financial companies, this is likely because companies in these sectors have the lowest number of stock splits on average, meaning that they don't often increase the supply of their stock.

What is the Average Research and Development Spending by Sector in the S&P500?

In [198]:
average_rnd = rnd_data.mean().sort_values(ascending=False)

frequencies = stock_info.groupby('Sector').size()
rnd_sectors = frequencies.copy()
for i, row in rnd_sectors.items():
    rnd_sectors[i] = 0

for i, row in stock_info.iterrows():
    for j, row2 in average_rnd.items():
        if(stock_info.iloc[i]['Ticker'] == j):
            
            for k, row3 in rnd_sectors.items():
                if (k == stock_info.iloc[i]['Sector']):
                    rnd_sectors[k] += row2
In [199]:
for i, val in rnd_sectors.items():
    rnd_sectors[i] /= frequencies[i]
rnd_sectors
Out[199]:
Sector
Communication Services    477795431
Consumer Discretionary    158110688
Consumer Staples            1794346
Energy                     44666720
Financials                   361616
Health Care               334622943
Industrials                51175869
Information Technology    414271709
Materials                  58452726
Real Estate                   47784
Utilities                     68316
dtype: int64
In [200]:
fig1, ax1 = plt.subplots(figsize=(10,10))
ax1.pie(rnd_sectors, labels=rnd_sectors.index, autopct='%1.1f%%', startangle=90, )
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

Overall, the three sectors that spend the most on research and development are Communication Services, Information Technology, and Health Care. This makes sense because each of these sectors are highly dependent on innovation and research to produce new products. Consumer Staples and Financials spend remarkably little on Research and Development, likely since these services remain similar and do not need much development.

Do fundamental factors explain change in price?

Fit a Linear Model to predict APPL's stock price using fundamental factors

In [282]:
frame = { 'close': close_data['AAPL'][:-2], 
         'ebidta': ebitda_data['AAPL'], 
         'eps': eps_data['AAPL'],
         'ev': ev_data['AAPL'],
         'gm': gm_data['AAPL'],
         'pe': pe_data['AAPL'],
         'rnd': rnd_data['AAPL'],
         'mc': mc_data['AAPL']
        } 
  
AAPL = pd.DataFrame(frame)
In [283]:
AAPL.head()
Out[283]:
close ebidta eps ev gm pe rnd mc
date
2019-04-01 47.810 2.730100e+10 1.055 8.491590e+11 0.38 13.111 3.902000e+09 7.792000e+11
2019-04-02 48.505 2.730100e+10 1.055 8.491590e+11 0.38 13.111 3.902000e+09 7.792000e+11
2019-04-03 48.837 2.730100e+10 1.055 8.491590e+11 0.38 13.111 3.902000e+09 7.792000e+11
2019-04-04 48.922 2.730100e+10 1.055 8.491590e+11 0.38 13.111 3.902000e+09 7.792000e+11
2019-04-05 49.250 2.730100e+10 1.055 8.491590e+11 0.38 13.111 3.902000e+09 7.792000e+11
In [284]:
#Fit a linear regression model using OLS in statsmodels (SciKit-Learn does not display summary statistics like p-value)

est = sm.OLS(AAPL['close'], AAPL[['ebidta', 'eps', 'ev', 'gm', 'pe', 'rnd', 'mc']])
est2 = est.fit()
print(est2.summary())     #Print summary statistics, including p-value
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  close   R-squared:                       0.778
Model:                            OLS   Adj. R-squared:                  0.775
Method:                 Least Squares   F-statistic:                     276.9
Date:                Sun, 20 Dec 2020   Prob (F-statistic):           3.69e-77
Time:                        23:31:03   Log-Likelihood:                -581.81
No. Observations:                 241   AIC:                             1172.
Df Residuals:                     237   BIC:                             1186.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
ebidta      9.067e-10   5.21e-11     17.406      0.000    8.04e-10    1.01e-09
eps        -4.799e-20   1.97e-20     -2.438      0.015   -8.68e-20   -9.22e-21
ev         -4.058e-10   2.13e-11    -19.079      0.000   -4.48e-10   -3.64e-10
gm          1.973e-19   4.11e-20      4.801      0.000    1.16e-19    2.78e-19
pe         -4.149e-20   1.14e-20     -3.626      0.000    -6.4e-20   -1.89e-20
rnd         3.147e-09   6.59e-10      4.779      0.000    1.85e-09    4.44e-09
mc          4.591e-10   2.23e-11     20.616      0.000    4.15e-10    5.03e-10
==============================================================================
Omnibus:                        1.062   Durbin-Watson:                   0.126
Prob(Omnibus):                  0.588   Jarque-Bera (JB):                0.977
Skew:                           0.156   Prob(JB):                        0.614
Kurtosis:                       2.993   Cond. No.                     2.66e+29
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 6.74e-33. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

The variance in the fundamental factors explains 77.8% of the variance in the closing price of AAPL. All of the fundamental factors are significantly linearly associated with the closing price of Apple.


Section 2:

What causes stock prices to change?

In short, just like any other commodity, stock prices are determined by the supply and demand of a company's stock$^{[14]}$

In an efficient market, the demand of a company's stock is equal to the supply of the stock available. If a company exceeds expected performance for the fiscal term, then investors will demand more of the company's stock, and since the supply of the company's stock does not increase as quickly in response, the price of it's stock goes up.

There are thousands and millions of factors that affect the supply and demand for a company's stock. However, there are three major categories of such factors$^{[15]}$:

Fundamental Factors:

* Fundamental factors drive stock prices based on a company's earnings and profitability from producing and selling goods and services.

* In an efficient market, stock prices would be determined primarily by fundamentals: i.e. "How well is the company performing?"

Technical Factors

* Technical factors relate to a stock's price history in the market pertaining to chart patterns, momentum, and behavioral factors of traders and investors.

* Technical factors are the mix of external conditions that alter the supply of and demand for a company's stock. Examples include inflation and economic strength of peers.

Overall Investor Sentiment

* Market sentiment is often subjective, biased, and obstinate. If investors dwell on a single piece of news or are "caught up in the hype", the stock becomes artificially high or low based on the fundamentals.

Assumption: On a short time-scale where fundamental factors remain constant (unreported), stock prices are mostly influenced by their previous prices. Overall investor sentiment and short-term investor "reactions" to moving stock prices can lead to high volatility and inefficiency in the market. With the proper statistical tools, these inefficiencies can be exploited for profit.

Goal: If we can model the dependency of stock prices on their previous prices, we can use this to predict future movements in the market and invest accordingly.

Hypothesis: ARIMA is a good enough model to exploit inefficiencies in the market caused by investor sentiment and is a reasonable tool for predicting future stock performance.


What is ARIMA? (Autoregressive Integrated Moving Average)

ARIMA models aim to describe autocorrelations in time-series data

ARIMA is a three-part model:

Autoregression (AR) refers to a model that shows a changing variable that regresses on its own lagged, or prior, values.

Integrated (I) represents the differencing of raw observations to allow for the time series to become stationary, i.e., data values are replaced by the difference between the data values and the previous values.

Moving average (MA) incorporates the dependency between an observation and a residual error from a moving average model applied to lagged observations.

In Essence: $$ f(x_t) = \sum_{i=1}^{p} (\alpha_i * f(x_{t-i})) + \sum_{j=1}^{q} (\beta_i * \epsilon_{t-j}) + \epsilon_t $$

The value at time-step t is determined by a weighted sum of the past t-1 to t-p time-steps plus a weighted sum of the past t-1 to t-q residual errors plus an error coefficient.

A standard notation would be ARIMA(p,d,q)

p: the number of lag observations in the model; also known as the lag order.

d: the number of times that the raw observations are differenced; also known as the degree of differencing.

q: the size of the moving average window; also known as the order of the moving average.

ARIMA models can also account for seasonal time series data. Seasonal ARIMA models are denoted with ARIMA(p,d,q)(P,D,Q)[m]

m refers to the number of periods in each season.

P,D,Q refer to the autoregressive, differencing, and moving average terms for the seasonal part of the ARIMA model.

Note: to determine whether or not the time series should be differenced, an Augmented Dicky-Fuller Test is performed. The null hypothesis is that a unit root exists in the time series, or in other words, there is an underlying stochastic trend, or systematic, yet unpredictable pattern. The alternative hypothesis is that the time series exhibits Stationarity, an assumption of the ARIMA model.

Special Cases:

An ARIMA(0, 1, 0) represents a random walk/random.htm) -> $X_{t}=X_{t-1}+\epsilon _{t}$

An An ARIMA(0, 0, 0) represents a white noise model

An ARIMA(0, 1, 2) model is a Damped Holt's model

An ARIMA(0, 1, 1) model is a basic exponential smoothing model -> $s_{t}=\alpha*x_{t-1}+(1-\alpha)*s_{t-1}$

How are ARIMA models selected?

Estimate out-of-sample prediction error with Akaike Information Criterion (AIC)

AIC estimates the relative amount of information lost by a given model: the less information a model loses, the higher the quality of that model. AIC deals with the trade-off between the goodness of fit of the model (overfitting) and the simplicity of the model (underfitting). A good model is one that minimizes AIC.

ACF Plot

Example ACF

PACF Plot

Example PACF

While modeling, we don’t want to keep too many features which are correlated as that can create multicollinearity issues.

Use of ARIMA:

* In recent literature, ARIMA models are used as baseline performance measures to compare deep learning approaches like RNN LSTM.

* ARIMA predictions can also be used as engineered model parameters for Deep Learning models, and can be used in determining how "related" two Time-Series Objects are.

For more information on ARIMA, click here

In [29]:
def predict_auto_arima(data, target, columns, split, days_ahead, verbose = True):
    predicted_returns = {}
    datasets = []
    count = 1
    for col in columns:
        print(f'Processing ticker {count}')
        print(col)
        print('\n')
        X = data[target][col]
        X = X.dropna()
        differenced = X.diff()
        datasets.append(differenced)
        if verbose:
            
            #Plot the Time Series for close data
            plt.figure()
            plt.plot(X)
            plt.title(col)
            plt.ylabel(str(target))
            plt.xlabel('Date')
            plt.show()
        
            #Check if the data is Stationary with Augmented Dicky-Fuller Test
            print("Is the Data Stationary? (Augmented Dicky-Fuller Test)")
            adf_test = ADFTest(alpha = 0.05)
            dicky_res = adf_test.should_diff(X)
            print('p-value: '+str(dicky_res[0]))
            print('The Time Series is Stationary? '+str(dicky_res[1]))
            print("\n")
    
        train, test = X[0:int(len(X)*split)], X[int(len(X)*split):] #Create train/test split
        test_set_range = test.index 
        train_set_range = train.index
        if verbose:
            #Plot the time series and denote the training period and the testing period
            plt.figure()
            plt.plot(train_set_range, train, label = 'train')
            plt.plot(test_set_range, test, label = 'test')
            plt.title(col)
            plt.xlabel('Date')
            plt.ylabel(str(target))
            plt.legend()
            plt.show()
    
            #Plot the ACF graph
            sm.graphics.tsa.plot_acf(X.values.squeeze(), lags=40)
            plt.show()
            
            #Plot the PACF graph
            sm.graphics.tsa.plot_pacf(X.values.squeeze(), lags=40)
            plt.show()
    
        #Automatically fit an ARIMA model to the time series
        n_diffs = 2
        trace = False
        if verbose:
            trace = True
        else:
            trace = False
        auto = pm.auto_arima(train, seasonal=False, stepwise=True,
                     suppress_warnings=True, error_action="ignore", max_p=10, max_q = 10, max_d = 10,
                     max_order=None, random_state = 9, n_fits = 50, trace=trace)
        if verbose:
            print(auto.summary())
    
                
        df = pd.DataFrame()
        df["date"] = pd.date_range(test.index[0], periods=days_ahead, freq="D")
        df = df.set_index(df["date"])
        
        
        prediction = pd.DataFrame(auto.predict(n_periods = days_ahead), index = df.index)
        prediction.columns = ['predicted_close']
        if verbose:
            #plot the n-day prediction against the testing period
            plt.figure(figsize=(10,10))
            plt.plot(train, label = "train")
            plt.plot(test, label="test")
            plt.plot(prediction, label = "predicted")
            plt.title(col)
            plt.xlabel('Date')
            plt.ylabel('Closing Price')
            plt.legend()
            plt.show()
    
        pred_price = prediction.iloc[-1][0]
        print(prediction)

        bought_price = train.iloc[-1]
        print(bought_price)
        
        print("Predicted Return:")
        pred_ret = (pred_price-bought_price)/bought_price        
        
        predicted_returns[col] = prediction
        print(pred_ret)
        print("\n")
        
        print("Actual Return:")
        try:
            print((test.iloc[days_ahead] - bought_price)/bought_price)
        except:
            print("Date for Prediction not in Testing Period")
        print("\n")
        
        #Plot a Rolling Prediction using ARIMA
        train_data, test_data = X[0:int(len(X)*split)], X[int(len(X)*split):]
        training_data = train_data.values
        test_data = test_data.values
        history = [x for x in training_data]
        model_predictions = []
        N_test_observations = len(test_data)
        for time_point in range(N_test_observations):
            model = ARIMA(history, order=(4,2,0))
            model_fit = model.fit(disp=2)
            output = model_fit.forecast()
            yhat = output[0]
            model_predictions.append(yhat)
            true_test_value = test_data[time_point]
            history.append(true_test_value)
        MSE_error = mean_squared_error(test_data, model_predictions)
        if verbose:
            print('Testing Mean Squared Error is {}'.format(MSE_error))
        
        test_set_range = X[int(len(X)*split):].index
        if verbose:
            plt.plot(test_set_range, model_predictions, color='blue',  linestyle='dashed',label='Predicted Price')
            plt.plot(test_set_range, test_data, color='red', label='Actual Price')
            plt.title(str(col) + ' Closing Price Prediction')
            plt.xlabel('Date')
            plt.ylabel('Prices')
            plt.legend()
            plt.show()
    
        count += 1

    return predicted_returns, datasets

For Walmart, JPMorgan Chase, and Berkshire Hathaway, predict closing price with ARIMA using a 90-10 testing training split.

For each stock, plot the ACF and PACF graphs, perform a Dicky-Fuller Test for Stationarity, Train an Arima model with minimum AIC, and then perform two analyses:

1) Using the last day of the training set as a "purchase date" calculate the predicted percent returns in 20 days following the purchase. Compare this predicted return to the actual return in the period.

2) Predict the closing price of the testing period on a rolling day-by-day basis where we predict the next day's closing price, then when the actual closing price is known, we adjust the model, and predict the following day. From this we can calculate the Mean Squared Error of the predictor.

In [ ]:
a, b = predict_auto_arima(equity_data, 'close', ['WMT', 'JPM','BRK.B'], 0.7, 20, verbose=True)
In [ ]:
### For Walmart:
    ARIMA(2,1,2) with AIC

Conclusion: By itself, ARIMA is pretty bad at predicting stock prices, but it has valuable uses.

So why is ARIMA alone a bad predictor?

Uses only closing price as a signal. The assumption is highly oversimplifying the factors that affect stock prices. In reality, previous prices and residuals alone are not* enough

* Does not account for sudden changes or particularly volatile signals

* Does not adjust for earnings dates and changes in fundamental factors

* Predicting stock prices is notoriously difficult

So why is ARIMA relevant?

* Use it as an input to an ensemble model

* Use it to generate engineered features

* Use as a baseline comparison

* Use it to cluster stocks together based on their movement


Section 3: Clustering the S&P500

What is Dynamic Time Warping (DTW)?

DTW is an algorithm for measuring similarity between two time series, which may vary in speed

DTW has 4 constraints:

* Every index from the first sequence must be matched with one or more indices from the other sequence, and vice versa

* The first index from the first sequence must be matched with the first index from the other sequence (but it does not have to be its only match)

* The last index from the first sequence must be matched with the last index from the other sequence (but it does not have to be its only match)

* The mapping of the indices from the first sequence to indices from the other sequence must be monotonically increasing, and vice versa, i.e. if $j>i$ are indices from the first sequence, then there must not be two indices $l>k$ in the other sequence, such that index $i$ is matched with index $l$ and index $j$ is matched with index $k$, and vice versa

The optimal match is the match that satisfies all the constraints and has minimal cost, where the cost is computed as the sum of of the absolute differences for each matched pair of indices, between their values. As such, the time-series must be normalized (here using z-score) to avoid biasing stocks that cost more than others.

Example DTW

How do we use DTW to gain insights?

Using DTW costs as a distance metric between a set of time series, we can effectively perform clustering algorithms like Heirarchical Clustering and K-Means. We can also easily apply classification algorithms like K-Nearest Neighbors to classify a new stock

Intuition for clustering stocks with DTW:

The price of stocks is also influenced greatly by the performance of other stocks in the market. If we can meaningfully cluster stocks together in such a way that we can determine whether or not a movement in one stock will likely affect the movement in another stock, or if a particular group of stocks tends to move in a similar way, we can build a model that "reacts" to sudden movements. Similarly, by meaningfully clustering stocks, we can create a diversified portfolio that is not too heavily invested in any particular "type" of stock.

"But aren't stocks already grouped by industry?"

Yes, and in most cases stocks within similar industries affect each others' movement the most. However, we don't want to restrict ourselves by industry alone because:

* Some industries themselves are highly correlated (Example: Financial Services and Consumer Discretionary)

* Not all companies in a sector necissarily influence each other or move similarly

* Stock prices for companies are often correlated with stock prices of companies in other sectors

* It is difficult to compare fundamental factors between comapanies in different sectors. "Desired" ratios are different based on what the company does.

"So why use DTW?"

Clustering stocks together based on DTW allows us to group together stocks that follow similar movements. Doing so allows us to "respond" to sudden movements of a single stock within a cluster by confidently predicting that other stocks in the same cluster will have similar movements in the future. Similarly, this allows us to minimize portfolio risk by breaking down our portfolio by cluster so that movements in a single stock don't statistically affect the movements of other stocks in the entire portfolio. In essence, we can ensure that only a few stocks in our portfolio will follow similar movements in the market. Additional possible applications could be to split the universe by sector and then further filter each sector by Dynamic Time Warping Distance.

Goal: cluster stocks based on movement (momentum) using Dynamic Time Warping Distance

Hypothesis: DTW is an effective distance metric between stocks for clustering

For more information on Dynamic Time Warping, click here

In [45]:
#Create labels for each stock
labs = pd.DataFrame({'Names' : close_data.columns})

First, we calculate the pairwise Dynamic Time Warping Distance Between every pair of stocks in the S&P 500

In [48]:
#Since this is an incredibly costly algorithm, we will cluster stocks together based on the last 20 days of the time period
latter = close_data.iloc[-4:]
In [49]:
#Compute pairwise Dynamic Time Warping Distance between each pair of stocks in the S&P500, store it as a "correlation matrix"
distance_matrix = []
count = 0
for x in latter.columns:
    count += 1
    distance_matrix.append([])
    for y in latter.columns:
        distance, path = fastdtw(stats.zscore(latter[x]), stats.zscore(latter[y]), dist=euclidean)
        distance_matrix[count-1].append(distance)
In [103]:
#Convert the Matrix into a Dataframe:
matrix = pd.DataFrame(distance_matrix, index = close_data.columns, columns = close_data.columns)
N = matrix.shape[0]
df = matrix

Plot the Correlation Matrix of stocks in the S&P500

In [108]:
def plot_corr(df):
    corr = df.corr()
    # Plot the correlation matrix
    fig, ax = plt.subplots(figsize=(20,20))
    sbn.heatmap(corr.to_numpy(), xticklabels = False, yticklabels = False, ax=ax, cmap="RdYlGn")
    plt.title('Dynamic Time-Warping Distance')
    fig.show()
In [107]:
plot_corr(df)

Perform a Heirarchical Clustering of the stocks in the matrix and replot the heatmap rearranging by cluster

In [109]:
#Perform a heirarchical clustering of the matrix and replot the heatmap for the clusters found
X = df.corr().values
d = sch.distance.pdist(X)
L = sch.linkage(X, method='complete')
ind = sch.fcluster(L, 0.5*d.max(), 'distance')
columns = [df.columns.tolist()[i] for i in list((np.argsort(ind)))]
df = df.reindex(columns, axis=1)

plot_corr(df)

There appears to be 6 clusters within the S&P500 based on Dynamic Time Warping Distance (Looking at dark green squares along the main diagonal). One of these clusters appears to be significantly larger than the others

Show the Dendrogram for the Heirarchical Clustering

In [137]:
labels = df.index.to_numpy()
p = len(labels)

plt.figure(figsize=(30,30))
plt.title('Hierarchical Clustering Dendrogram (truncated)', fontsize=20)
plt.xlabel('Ticker', fontsize=16)
plt.ylabel('distance', fontsize=16)

# call dendrogram to get the returned dictionary 
# (plotting parameters can be ignored at this point)
R = dendrogram(
                L,
                truncate_mode='lastp',  # show only the last p merged clusters
                p=p,  # show only the last p merged clusters
                no_plot=False,
                )

# create a label dictionary
temp = {R["leaves"][ii]: labels[ii] for ii in range(len(R["leaves"]))}
def llf(xx):
    return "{}".format(temp[xx])

dendrogram(
            L,
            truncate_mode='lastp',  # show only the last p merged clusters
            p=p,  # show only the last p merged clusters
            leaf_label_func=llf,
            leaf_rotation=60.,
            leaf_font_size=7.,
            show_contracted=True,  # to get a distribution impression in truncated branches
            )
plt.show()

Cutting the Dendrogram above at certain heights can lead to N clusters. There seem to be three "main" clusters within the S&P500 represented here in Orange, Green, and Red. Each of these clusters is further broken down into a series of subclusters.

Now, we can identify which stocks belong to which clusters and the properties within each cluster

In [139]:
def give_cluster_assigns(df, numclust, transpose=True):
    clusters = {}
    if transpose==True:
        data_dist = pdist(df.transpose())
        data_link = linkage(data_dist,  metric='correlation', method='complete')
        cluster_assigns=pd.Series(sch.fcluster(data_link, numclust, criterion='maxclust'), index=df.columns)
    else:
        data_dist = pdist(df)
        data_link = linkage(data_dist,  metric='correlation', method='complete')
        cluster_assigns=pd.Series(sch.fcluster(data_link, numclust, criterion='maxclust'), index=df.index)
    for i in range(1,numclust+1):
        print("Cluster ",str(i),": ( N =",len(cluster_assigns[cluster_assigns==i].index),")", ", ".join(list(cluster_assigns[cluster_assigns==i].index)))
        clusters[i] = list(cluster_assigns[cluster_assigns==i].index)
    return cluster_assigns, clusters

Let's analyze the sector composition of each of our 6 clusters

In [138]:
gg, hh = give_cluster_assigns(df, 6, True)
Cluster  1 : ( N = 50 ) FAST, DISCA, INTC, DLTR, DGX, INTU, JBHT, DOV, FBHS, DOW, HFC, NOC, FCX, EXPD, LEG, GWW, HES, BWA, SJM, BKR, TAP, TGT, TSCO, TXT, TYL, UNH, ANET, UNP, ALLE, ALL, VTR, ALB, AIZ, AIG, WHR, ADI, AAL, SHW, CAT, CMS, PPG, CE, COST, PH, PHM, CPRT, CDW, CTXS, CBRE, OKE
Cluster  2 : ( N = 52 ) CRM, ULTA, KHC, ATO, FOXA, TDG, FLT, LEN, BBY, EXR, EVRG, MCO, CBOE, CCI, MSCI, PVH, NDAQ, CL, NEE, PNW, CME, PNR, NLOK, PG, CSCO, DIS, DHI, D, DAL, GILD, UPS, DE, WMT, WMB, ICE, AKAM, IT, IDXX, HPQ, IBM, VIAC, IRM, GE, JNPR, IFF, SPGI, BA, ALK, WAB, HSY, CI, PKG
Cluster  3 : ( N = 11 ) BR, AEE, ABT, VRTX, CF, PM, CVS, NRG, DRI, NEM, SYF
Cluster  4 : ( N = 61 ) SEE, BSX, BK, SLG, BLK, SPG, CHD, PYPL, YUM, XRX, RSG, ARE, APH, APD, UDR, UNM, AMP, V, ADBE, AJG, STZ, ADM, ADSK, SYK, WLTW, WFC, AES, WDC, AFL, TJX, PSX, AIV, MA, EXPE, FFIV, MET, ISRG, EQIX, KMB, GPN, LIN, MOS, PBCT, CNC, EA, DTE, INCY, PRGO, NTAP, NOW, VZ, DXC, ED, ATVI, TIF, AMGN, LHX, VAR, KR, VNO, EXC
Cluster  5 : ( N = 112 ) RL, MRO, RE, WEC, QCOM, HIG, MHK, PSA, EFX, ROK, MTD, ROP, EMR, URI, POOL, HSIC, XRAY, DG, ORCL, XOM, HWM, MCHP, CSX, PEG, PLD, DVA, CMI, HOLX, WRB, CPB, MAS, LKQ, MAA, AMT, K, GOOGL, AOS, VRSN, KMI, KO, KSU, TER, AVY, TEL, LH, SWK, BF.B, AVB, BIIB, NWSA, FE, SO, FDX, DISH, ITW, IQV, NVR, HPE, NSC, KMX, GPS, MPC, GRMN, FLS, ZTS, SNA, BLL, APTV, VLO, ADP, WU, RHI, HUM, NLSN, INFO, PCAR, WRK, WY, WYNN, MCK, SBAC, LYB, LRCX, LMT, SWKS, PRU, MSFT, MTB, ODFL, A, EBAY, EOG, BEN, APA, EQR, DVN, BIO, CINF, CMG, CERN, CB, CTLT, CAG, C, CVX, DRE, FMC, GS, AEP, GD, AME, FTI
Cluster  6 : ( N = 209 ) VRSK, JCI, NI, VMC, MGM, NVDA, WBA, WELL, PAYX, OXY, ORLY, WST, IR, WAT, PPL, NCLH, MDLZ, RMD, MLM, MMC, SIVB, LVS, SLB, MRK, O, PWR, LOW, STE, SYY, T, MXIM, TWTR, UHS, ES, BAX, BAC, AZO, EMN, AWK, AVGO, ESS, ETN, ETR, ETSY, EW, FITB, DUK, CFG, HST, COG, COO, COP, CXO, DD, DHR, FLIR, CHRW, GPC, ABBV, ALGN, AAP, HCA, HD, GLW, HON, HLT, ACN, ABC, SBUX, SCHW, ABMD, SNPS, BRK.B, XEL, BKNG, BXP, BMY, VFC, RTX, PXD, ZBRA, QRVO, RCL, ZBH, REG, REGN, RF, XYL, CDNS, RJF, CCL, CAH, ROL, AAPL, SRE, XLNX, ROST, STX, STT, TSN, TT, TTWO, TXN, AMAT, UAA, UAL, AON, ANTM, ANSS, AMZN, USB, AMCR, AMD, TRV, TROW, BDX, WM, AXP, TDY, TFC, TFX, TMO, TMUS, TPR, ALXN, HRL, CNP, J, FB, LUMN, LUV, FANG, LW, F, LYV, MAR, MCD, MDT, MKC, MKTX, MMM, MNST, CHTR, LNT, LNC, FIS, GIS, KEY, KEYS, KIM, KLAC, JNJ, FTV, FTNT, JKHY, GM, FRT, L, LB, ZION, LDOS, IVZ, FISV, LLY, JPM, MO, MS, OMC, ILMN, CTVA, CTSH, PAYC, CTAS, PEAK, PEP, PFE, PFG, PGR, COF, PKI, IEX, PNC, HII, CMCSA, CMA, CLX, DFS, EL, NWL, NUE, EIX, HAL, MSI, HAS, ECL, MU, DXCM, DLR, IPG, IPGP, NKE, IP, NOV, HBAN, HBI, DPZ, NTRS, NFLX, GL

Here, N is the number of stocks in each cluster and the corresponding tickers are those tickers corresponding to companies within each cluster.

The smallest cluster (cluster 3) has only 11 members. The largest cluster (cluster 6) has 209 members.

In [143]:
for j in range(1,7):
    print(f'Cluster #{j}')
    cluster1 = []
    cluster1_industry = []
    for i, x in gg.items():
        if x == j:
            cluster1.append(i)
    for x in cluster1:
        for a, b in stock_info.iterrows():
            if b['Ticker'] == x:
                cluster1_industry.append(stock_info.iloc[a]['Sector'])
    cluster1_industry = pd.Series(cluster1_industry)
    
    frequencies = cluster1_industry.groupby(cluster1_industry).size()
    fig1, ax1 = plt.subplots(figsize=(15, 15))
    ax1.pie(frequencies, labels=frequencies.index, autopct='%1.1f%%', startangle=90, )
    ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

    plt.show()
Cluster #1
Cluster #2
Cluster #3
Cluster #4
Cluster #5
Cluster #6

Cluster 1, 2, and 5 have a higher percentage of Industrials represented. A large portion of Cluster 3 is represented by Health Care while clusters 4 and 6 are largely composed of Information Technology and Financials. It is interesting to note that each of the 6 clusters (besides cluster 3, the smallest cluster) have member stocks from each of the 11 sectors. This indicates that stocks are not only highly correlated with other stocks in the same sector but also with stocks outside of their sector. This is largely because of the inter-sector operations that these companies conduct, with services being consumed and products being purchased among companies in different parts of the supply-chain. Ultimately, although GICS Sector classification is one way to cluster stocks within the S&P500, not all stocks within the same sector necessarily follow the same "movement".


Section 4: Graph Analysis

Using the Dynamic Time Warping Distance Matrix Calculated above, we can represent the stocks in the S&P500 as a graph where vertices are stocks and edges exist depending on the Dynamic Time Warping Distance between two stocks

In [144]:
print('Average distance between stocks in the S&P500: ' + str(np.mean(distance_matrix)))
Average distance between stocks in the S&P500: 3.4065674152263563
In [145]:
print('Standard Deviation of distance between stocks in the S&P500: ' + str(np.std(distance_matrix)))
Standard Deviation of distance between stocks in the S&P500: 1.9409034716318543

Create a graph where an edge exists between two companies if the Dynamic Time Warping Distance is 1 or more standard deviations below the average Dynamic Time Warping Distance between all stocks

In [231]:
graph = np.where(distance_matrix <= np.mean(distance_matrix)-np.std(distance_matrix), 1, 0)
In [232]:
G = nx.from_numpy_matrix(np.array(graph))  
nx.draw(G, with_labels=True, labels=dict(labs['Names']))
In [ ]:
import itertools
k = 6
comp = nx.algorithms.community.centrality.girvan_newman(G)
for communities in itertools.islice(comp, k):
    print(tuple(sorted(c) for c in communities)) 
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 71, 72, 73, 74, 75, 76, 77, 78, 79, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 96, 97, 98, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 125, 126, 127, 128, 129, 131, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 221, 222, 223, 224, 225, 226, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 313, 314, 315, 316, 317, 318, 319, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494], [70, 80, 95, 99, 112, 113, 124, 130, 132, 194, 220, 227, 228, 229, 257, 294, 312, 320, 321, 356, 416, 477])
In [ ]:
1+2
In [233]:
graph_vals = pd.DataFrame({'Eigenvector Centrality' : list(nx.algorithms.centrality.eigenvector_centrality(G).values()),
                           'Closeness Centrality': list(nx.algorithms.centrality.closeness_centrality(G).values()),
                           'Betweenness Centrality': list(nx.algorithms.centrality.betweenness_centrality(G).values()),
                           'Degree Centrality': list(nx.algorithms.centrality.degree_centrality(G).values())}, index = list(nx.algorithms.centrality.degree_centrality(G).keys()))
graph_vals.head()
Out[233]:
Load Centrality Eigenvector Centrality Closeness Centrality Betweenness Centrality Degree Centrality
0 0.004202 0.029965 0.360584 0.004986 0.226721
1 0.008636 0.003043 0.282770 0.006274 0.182186
2 0.002949 0.075742 0.370037 0.003091 0.299595
3 0.002525 0.037207 0.305882 0.002454 0.259109
4 0.004158 0.077141 0.371708 0.003848 0.303644
In [234]:
graph_vals.set_index(labs['Names'], inplace=True)

Which companies have the highest closeness?

In [252]:
fig, ax = plt.subplots(figsize=(20,8))
top_centralities = graph_vals['Closeness Centrality'].sort_values(ascending=False)[0:15]
plt.bar(top_centralities.index, top_centralities)
plt.title('Stocks in S&P500 with highest Closeness Centrality')
plt.ylabel('Closeness Centrality')
plt.xlabel('Ticker')
fig.show()

Which companies have the highest betweenness?

In [249]:
fig, ax = plt.subplots(figsize=(20,8))
top_centralities = graph_vals['Betweenness Centrality'].sort_values(ascending=False)[0:15]
plt.bar(top_centralities.index, top_centralities)
plt.title('Stocks in S&P500 with highest Betweenness Centrality')
plt.ylabel('Betweenness Centrality')
plt.xlabel('Ticker')
fig.show()

Which companies have the highest Degree Centrality?

In [253]:
fig, ax = plt.subplots(figsize=(20,8))
top_centralities = graph_vals['Degree Centrality'].sort_values(ascending=False)[0:15]
plt.bar(top_centralities.index, top_centralities)
plt.title('Stocks in S&P500 with highest Degree Centrality')
plt.ylabel('Degree Centrality')
plt.xlabel('Ticker')
fig.show()

Which companies have the highest Eigenvector Centrality?

In [262]:
fig, ax = plt.subplots(figsize=(20,8))
top_centralities = graph_vals['Eigenvector Centrality'].sort_values(ascending=False)[0:70]
plt.bar(top_centralities.index, top_centralities)
plt.title('Stocks in S&P500 with highest Eigenvector Centrality')
plt.ylabel('Eigenvector Centrality')
plt.xlabel('Ticker')
fig.show()
In [ ]:
 

Scrape News Articles for sentiment data

In [89]:
# Import libraries


finwiz_url = 'https://finviz.com/quote.ashx?t='
In [91]:
news_tables = {}
tickers2 = tickers
tickers2.append('BRK-B')
tickers2.append('BF-B')
for ticker in tickers2:
    try:
        url = finwiz_url + ticker
        req = Request(url=url,headers={'user-agent': 'my-app/0.0.1'}) 
        response = urlopen(req)    
        # Read the contents of the file into 'html'
        html = BeautifulSoup(response)
        # Find 'news-table' in the Soup and load it into 'news_table'
        news_table = html.find(id='news-table')
        # Add the table to our dictionary
        news_tables[ticker] = news_table
    except:
        continue
In [92]:
# Read one single day of headlines for 'AMZN' 
amzn = news_tables['AMZN']
# Get all the table rows tagged in HTML with <tr> into 'amzn_tr'
amzn_tr = amzn.findAll('tr')

for i, table_row in enumerate(amzn_tr):
    # Read the text of the element 'a' into 'link_text'
    a_text = table_row.a.text
    # Read the text of the element 'td' into 'data_text'
    td_text = table_row.td.text
    # Print the contents of 'link_text' and 'data_text' 
    print(a_text)
    print(td_text)
    # Exit after printing 4 rows of data
    if i == 3:
        break
10 Best Stocks To Invest In Right Now According To Tech Billionaire
Dec-20-20 03:04PM  
The 1 Mega-Cap Tech Stock That Doesn't Fear Washington
12:09PM  
MercadoLibre's Profits Surged During the Pandemic. Here's Why.
11:00AM  
2 Stocks to Supplement Your Social Security Income
11:00AM  
In [93]:
parsed_news = []

# Iterate through the news
for file_name, news_table in news_tables.items():
    # Iterate through all tr tags in 'news_table'
    for x in news_table.findAll('tr'):
        # read the text from each tr tag into text
        # get text from a only
        text = x.a.get_text() 
        # splite text in the td tag into a list 
        date_scrape = x.td.text.split()
        # if the length of 'date_scrape' is 1, load 'time' as the only element

        if len(date_scrape) == 1:
            time = date_scrape[0]
            
        # else load 'date' as the 1st element and 'time' as the second    
        else:
            date = date_scrape[0]
            time = date_scrape[1]
        # Extract the ticker from the file name, get the string up to the 1st '_'  
        ticker = file_name.split('_')[0]
        
        # Append ticker, date, time and headline as a list to the 'parsed_news' list
        parsed_news.append([ticker, date, time, text])
        
#parsed_news
In [94]:
nltk.download('vader_lexicon')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-94-9ed5537a9024> in <module>
----> 1 nltk.download('vader_lexicon')

NameError: name 'nltk' is not defined
In [ ]:
# Instantiate the sentiment intensity analyzer
vader = SentimentIntensityAnalyzer()

# Set column names
columns = ['ticker', 'date', 'time', 'headline']

# Convert the parsed_news list into a DataFrame called 'parsed_and_scored_news'
parsed_and_scored_news = pd.DataFrame(parsed_news, columns=columns)

# Iterate through the headlines and get the polarity scores using vader
scores = parsed_and_scored_news['headline'].apply(vader.polarity_scores).tolist()

# Convert the 'scores' list of dicts into a DataFrame
scores_df = pd.DataFrame(scores)

# Join the DataFrames of the news and the list of dicts
parsed_and_scored_news = parsed_and_scored_news.join(scores_df, rsuffix='_right')

# Convert the date column from string to datetime
parsed_and_scored_news['date'] = pd.to_datetime(parsed_and_scored_news.date).dt.date

parsed_and_scored_news.head()
In [ ]:
stocks = []
sentiments = []
dd = parsed_and_scored_news.groupby(['ticker'])
for key, value in dd:
    if key in gg.index:
        stocks.append(key)
        sentiments.append(value['compound'].mean())
In [ ]:
sentiment_info = pd.DataFrame({'Ticker': stocks, 'Sentiment': sentiments}) #Create Dataframe
In [ ]:
sentiment_info['Sentiment'].mean()
In [ ]:
stock_info
In [ ]:
average_sentiment = {}
for i, x in stock_info.iterrows():
    for j, y in sentiment_info.iterrows():
        if y['Ticker'] == x['Ticker']:
            try:
                average_sentiment[x['Sector']] += y['Sentiment']
            except:
                average_sentiment[x['Sector']] = y['Sentiment']
In [ ]:
for i, x in stock_info.groupby(['Sector']).count()['Name'].items():
    average_sentiment[i] /= x
average_sentiment
In [ ]:
average_sentiment = {}
for i, x in gg.items():
    for j, y in sentiment_info.iterrows():
        if y['Ticker'] == i:
            try:
                average_sentiment[x] += y['Sentiment']
            except:
                average_sentiment[x] = y['Sentiment']
In [ ]:
for x in average_sentiment.keys():
    average_sentiment[x] /= len(hh[x])
average_sentiment

Final Analysis:

Let's compare the portfolio performance of the S&P500 to a portfolio composed of the top 3 performers by sector in the S&P500, and the portfolio of top 3 performers by Dynamic Time Warping Distance Cluster.

In [ ]:
 
In [ ]:
 

Conclusion:

In [ ]:
 

Why is this Important?

In [ ]:
1. Retirement
2. Capital Allocation
3. Better Savings than Keeping Money in the Bank
4. By increasing returns and minimizing risk through diversification, we make investing 

Further Work:

Potential further work for this project would be to include more fundamental factors for each of the companies as well as incorporate economic data about the general state of the economy. I would also like to go deeper into using these factors along with the features extracted from the Graph representation and the sentiment analysis to perform a regression for stock returns to make a better model for these predictions. Additionally, I would like to explore more regression algorithms and try to see the performance with an RNN LSTM deep neural network model. In the future, I would also like to look into different methods of clustering stocks together besides Heirarchical Clustering on Dynamic Time Warping Distance.

Video Walkthrough:

In [60]:
from IPython.display import HTML
HTML('<iframe width="600" height="200" src="https://www.youtube.com/embed/rfscVS0vtbw" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')
Out[60]:
In [99]:
plt.plot((equity_data['close']['MSFT'] - equity_data['close']['MSFT'].shift(1))/equity_data['close']['MSFT'].shift(1))
plt.plot(((equity_data['close']['MSFT'] - equity_data['close']['MSFT'].shift(1))/equity_data['close']['MSFT'].shift(1)).rolling(10).mean())
plt.plot(((equity_data['close']['MSFT'] - equity_data\['close']['MSFT'].shift(1))/equity_data['close']['MSFT'].shift(1)).rolling(30).mean())
plt.plot(((equity_data['close']['MSFT'] - equity_data['close']['MSFT'].shift(1))/equity_data['close']['MSFT'].shift(1)).rolling(100).mean())
Out[99]:
[<matplotlib.lines.Line2D at 0x7fae27b983d0>]
In [95]:
plt.plot(equity_data['close']['MSFT'])
plt.plot(equity_data['close']['MSFT'].rolling(10).mean())
plt.plot(equity_data['close']['MSFT'].rolling(30).mean())
plt.plot(equity_data['close']['MSFT'].rolling(100).mean())
plt.show()
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [59]:
# Two pass clustering
# 1-We cluster the corr matrix
#   We sort the survey data according to this clustering
# 2-For cluster bigger than a threshold we cluster those sub-clusters
#   We sort the survey data according to these clustering



cluster_th = 1 #Use 4 clusters

X = df.values
d = sch.distance.pdist(X)
L = sch.linkage(X, method='complete')
ind = sch.fcluster(L, 0.5*d.max(), 'distance')

columns = [df.columns.tolist()[i] for i in list(np.argsort(ind))]
df = df.reindex(columns, axis=1)

unique, counts = np.unique(ind, return_counts=True)
counts = dict(zip(unique, counts))
print(ind)
print(unique)
print(counts)

i = 0
j = 0
columns = []
for cluster_l1 in set(sorted(ind)):
    j += counts[cluster_l1]
    sub = df[df.columns.values[i:j]]
    if counts[cluster_l1]>cluster_th:        
        X = sub.corr().values
        d = sch.distance.pdist(X)
        L = sch.linkage(d, method='complete')
        ind = sch.fcluster(L, 0.5*d.max(), 'distance')
        col = [sub.columns.tolist()[i] for i in list((np.argsort(ind)))]
        sub = sub.reindex(col, axis=1)
    cols = sub.columns.tolist()
    columns.extend(cols)
    i = j
df = df.reindex(columns, axis=1)

plot_corr(df, N)
[6 1 8 7 8 8 7 3 7 4 1 4 5 4 3 6 4 4 1 4 1 4 2 1 8 2 1 1 8 7 7 8 6 4 4 6 8
 1 7 7 8 6 6 4 4 5 4 2 4 5 8 6 8 7 8 2 8 8 2 7 6 5 6 6 4 7 1 4 5 8 3 7 4 1
 8 6 6 7 1 6 2 1 2 7 7 1 1 6 3 8 4 8 8 2 6 2 8 8 7 2 6 5 1 4 8 7 8 8 8 1 5
 1 2 2 5 8 6 7 7 1 3 6 8 2 2 8 2 8 5 1 2 8 2 1 5 7 1 1 1 7 6 3 4 8 6 6 4 7
 4 6 7 4 5 8 8 8 5 6 4 6 8 8 8 8 8 2 8 4 1 4 2 7 8 1 7 1 1 5 6 4 7 8 8 8 5
 2 6 2 7 6 7 8 6 2 2 8 7 8 8 5 8 4 5 5 6 1 8 7 7 8 8 8 1 1 5 7 8 5 8 5 2 8
 5 8 2 5 5 2 2 2 7 2 7 4 6 1 1 7 8 8 5 8 2 4 2 5 8 7 1 8 8 7 2 8 6 7 7 2 7
 7 4 6 5 6 4 5 7 8 7 1 2 6 4 4 6 7 6 8 8 8 6 7 8 8 7 6 8 4 5 8 5 8 6 6 2 8
 8 4 8 5 7 8 8 8 8 8 7 4 5 8 6 8 2 6 7 6 6 7 8 8 2 2 3 7 8 8 2 6 1 8 4 3 5
 4 7 7 8 5 7 6 8 6 1 8 6 8 8 7 8 4 6 8 5 7 7 7 2 7 1 1 2 8 6 3 7 2 2 5 1 8
 4 6 5 4 2 8 8 4 6 8 8 6 7 7 8 5 8 6 8 6 7 5 7 4 8 6 7 8 4 1 8 1 8 4 5 7 6
 4 2 7 8 7 8 4 5 6 3 4 8 8 1 2 8 6 6 7 7 1 4 4 8 8 7 7 8 1 7 8 7 8 8 1 1 7
 8 4 8 2 1 4 1 2 5 8 4 4 7 2 5 8 4 8 6 3 1 4 2 8 8 4 6 8 4 1 4 7 2 2 5 6 8
 5 6 6 7 7 5 6 4 7 4 8 8 8 5]
[1 2 3 4 5 6 7 8]
{1: 50, 2: 52, 3: 11, 4: 61, 5: 46, 6: 66, 7: 79, 8: 130}
In [ ]: